It’s been a while since the latest technical update in the project and I am fully aware that you were missing it so it’s time to recap with a really cool announcement:
We finally made a self-hosted Bootstrappable TinyCC in RISC-V
Most of you probably remember I already backported the Bootstrappable TinyCC compiler, but I didn’t test it in a proper environment. Now, we can confidently say it is able to compile itself, a “large” program that makes use of more complex C features than I did in the tests.
All this work was done by Andrius Štikonas and myself. Janneke helped us a lot with Mes related parts, too. The work this time was pretty hard, honestly. Most of the things we did here are not obvious, even for C programmers.
I’m not used to this kind of quirks of the C language. Most of them are really specific, related with the standards and many others are just things were missing. I hope the ones I chose to discuss here help you understand your computing better, as they did to me.
This is going to be veery long post, so take a ToC to help you out:
- Context
- Problems fixed
- TinyCC misses assembly instructions needed for MesLibC
- TinyCC’s assembly syntax is weird
- TinyCC does not support Extended Asm in RV64
- MesLibC
main
function arguments are not set properly - TinyCC says
__global_pointer$
is not a valid symbol - Bootstrappable TinyCC’s casting issues
- Bootstrappable TinyCC’s
long double
support was missing - MesCC struct initialization issues
- MesCC vs TinyCC size problems
- MesCC add support for signed shift operation
- MesCC switch/case falls-back to default case
- Boostrappable TinyCC problems with GOT
- Bootstrappable TinyCC generates wrong assembly in conditionals
- Support for variable length arguments
- MesLibC use
signed char
forint8_t
- MesLibC Implement
setjmp
andlongjmp
- More
- Reproducing what we did
- Conclusions
- What is next?
Context
You have many blogposts in the series to find the some context about the project, and even a FOSDEM talk about it, but they all give a very broad explanation, so let’s focus on what we are doing right now.
Here we have Mes, a Scheme interpreter, that runs MesCC, a C compiler, that is compiling our simplified fork of TinyCC, let’s call that Bootstrappable TinyCC. That Bootstrappable TinyCC compiler then tries to compile its own code. It compiles it’s own code because it’s goal is to add more flags in each compilation, so it has more features in each round1. We do all this because TinyCC is way faster than MesCC and it’s also more complex, but MesCC is only able to build a simple TinyCC with few features enabled.
During all this process we use a standard library provided by the Mes project, we’ll call it MesLibC, because we can’t build glibc at this point, and TinyCC does not provide it’s own C standard library.
With all this well understood, this is the achievement:
We made MesCC able to compile the Bootstrappable TinyCC, using MesLibC, to an executable that is able to compile the Bootstrappable TinyCC’s codebase to a binary that works and has all the features we need enabled.2
The process affected all the pieces in the system. We added changes in MesCC, MesLibC and the Bootstrappable TinyCC.
Why is this important?
We already talked long about the bootstrapping issue, the trusting trust attack and all that. I won’t repeat that here. What I’ll do instead is to be specific. This step is a big thing because this allows us to go way further in the chain.
All the steps before Mes were already ported to RISC-V mostly thanks to Andrius Štikonas who worked in Stage0-POSIX and the rest of glue projects that are needed to reach Mes.
Mes had been ported to RISC-V (64 bit) by W. J. van der Laan, and some patches were added on top of it by Andrius Štikonas himself before our current effort started.
At this moment in time, Mes was unable to build our bootstrappable TinyCC in RISC-V, the next step in the process, and the bootstrappable TinyCC itself was unable to build itself either. This was a very limiting point, because TinyCC is the first “proper” C compiler in the chain.
When I say “proper” I mean fast and fully featured as a C compiler. In x86, TinyCC is able to compile old versions of GCC. If we manage to port it to RISC-V we will eventually be able to build GCC with it and with that the world.
In summary, TinyCC is a key step in the bootstrapping chain.
Problems fixed
This work can be easily followed in the commits in my TCC fork’s
riscv-mes
branch, and in my Mes clone’s riscv-tcc-boot
branch. We are also identifying the contents of this blogpost in the git
history by adding the git tag self-hosted-tcc-rv64
to both of my forks. We
will try to keep both for future reference.
In Mes the process might be a little bit harder to follow because we sent most of the patches to Janneke and he merged them so when we were about to release this post I continued from Janneke’s branch to avoid divergences (I had some problems with that before). In any case, the code is there and searching by authors (Andrius and myself) would guide you to the changes we did.
Many commits have a long message you can go read there, but this post was born to summarize the most interesting changes we did, and write them in a more digestible way. Lets see if I manage to do that.
The following list is not ordered in any particular way, but we hope the selection of problems we found is interesting for you. We found some errors more, but these are the ones we consider more relevant.
TinyCC misses assembly instructions needed for MesLibC
TinyCC is not like GCC, TinyCC generates binary code directly, no assembly code in between. TinyCC has a separate assembler that doesn’t follow the path that C code follows.
It works the same in all architectures, but we can take RISC-V as an example:
TinyCC has riscv64-gen.c
which generates the binary files, but
riscv64-asm.c
file parses assembly code and also generates binary. As you can
see, binary generation is somehow duplicated.
In the RISC-V case, the C part had support for mostly everything since my backport, but the assembler did not support many instructions (which, by the way are supported by the C part).
MesLibC’s crt1.c
is written in assembly code. Its goal is to prepare the
main
function and call it. For that it needs to call jalr
instruction and
others that were not supported by TinyCC, neither upstream nor our
bootstrappable fork.
These changes appear in several commits because I didn’t really understood how the TinyCC assembler worked, and some instructions need to use relocations which I didn’t know how to add. The following commit can show how it feels to work on this, and shares how relocations are done:
There you can see we started to understand things in TinyCC, but some other changes came after this.
A very important not here is upstream TinyCC does not have support for these instructions yet so we need to patch upstream TinyCC when we use it, contribute the changes or find other kind of solutions. Each solution has its downsides and upsides, so we need to take a decision about this later.
TinyCC’s assembly syntax is weird
Following with the previous fix, TinyCC does not support GNU-Assembler’s syntax in RISC-V. It uses a simplified assembly syntax instead.
When we would do:
sd s1, 8(a0)
In TinyCC’s assembly we have to do:
sd a0, s1, 8
This requires changes in MesLibC, and it makes us create a separate folder for
TinyCC in MesLibC. See lib/riscv64-mes-tcc/
and lib/linux/riscv64-mes-tcc
for more details.
TinyCC does not support Extended Asm in RV64
Way later in time we also found TinyCC does not support Extended Asm in RV64. The functions that manage that are simply empty.
We spent some time until we realized what was going on in here for two reasons. First, there are few cases of Extended Asm in the code we were compiling. Second, it was failing silently.
Extended Asm is important because it lets you tell the compiler you are going to touch some registers in the assembly block, so it can protect variables and apply optimizations properly.
In our case, our assembly blocks were clobbering some variables that would have been protected by the compiler if the Extended Asm support was implemented.
Andrius found all the places in MesLibC where Extended Asm was used and rewrote the assembly code to keep variables safe in the cases it was needed.
The other option was to add Extended Asm support for TinyCC, but we would need to add it in the Bootstrappable TinyCC and also upstream. This also means understanding TinyCC codebase very well and making the changes without errors, so we decided to simplify MesLibC, because that is easier to make right. We are probably going to need to do this later on anyway, but we’ll try to delay this as much as possible.
MesLibC main
function arguments are not set properly
Following the previous problem with assembly, we later found input arguments of
the main
function, that come from the command line arguments, were not
properly set by our MesLibC. Andrius also took care of that in
4f4a1174 in Mes.
This error was easier to find than others because when we found issues with this we already had a compiled TinyCC. So we just needed to fix simple things around it.
TinyCC says __global_pointer$
is not a valid symbol
This is a small issue that was a headache for a while, but it happened to be a very simple issue.
In RISC-V there’s a symbol, __global_pointer$
, that is used for dynamic
linking, defined in the ABI. But TinyCC had issues to parse code around it and
it took us some time to realize it was the dollar sign ($
) which was causing
the issues in this point.
TinyCC does not process dollars in identifiers unless you specifically set a
flag (-fdollars-in-identifiers
) when running it. In the RISC-V case, that
flag must be always active because if it isn’t the __global_pointer$
can’t be processed.
We tried to set that flag in the command line but we had other issues in the command line argument parsing (we found and fixed them later later) so we just hardcoded it.
This issue is interesting because it’s an extremely simple problem, but its effect appears in weird ways and it’s not always easy to know where the problem is coming from.
Bootstrappable TinyCC’s casting issues
This one was a really hard one to fix.
When running our Bootstrappable TinyCC to build MesLibC we found this error:
cannot cast from/to void
We managed to isolate a piece of C code that was able to replicate the problem.3
long cast_charp_to_long (char const *i)
{
return (long)i;
}
long cast_int_to_long (int i)
{
return (long)i;
}
long cast_voidp_to_long (void const *i)
{
return (long)i;
}
void main(int argc, char* argv[]){
return;
}
Compiling this file raised the same issue, but then I realized I could remove two of the functions on the top and the error didn’t happen. Adding one of those functions back raised the error again.
I tried to change the order of the functions and the functions I chose to add, and I could reproduce it: if there were two functions it failed but it could build with only one.
Andrius found that the function type was not properly set in the RISC-V code
generation and its default value was void
, so it only failed when it compiled
the second function.
Knowing that, we could take other architectures as a reference to fix this, and so we did.
See 6fbd1785.
Bootstrappable TinyCC’s long double
support was missing
When I backported the RISC-V support to our Bootstrappable TinyCC I missed the
long double
support and I didn’t realize that because I never tested large
programs with it.
The C standard doesn’t define a size for long double
(it just says it has to
be at least as long as the double
), but its size is normally set to 16 bytes.
All this is weird in RV64, because it doesn’t have 16 byte size registers. It
needs some extra support.
Before we fixed this, the following code:
long double f(int a){
return a;
}
Failed with:
riscv64-gen.c:449 (`assert(size == 4 || size == 8)`)
Because it was only expecting to use double
s (8 bytes) or float
s (4 bytes).
In upstream TinyCC there were some commits that added long double
support
using, and I quote, a mega hack, so I just copied that support to our
Bootstrappable TinyCC.
See a7f3da33456b.
After this commit, some extra problems appeared with some missing symbols. But
these errors were link-time problems, because TinyCC had the floating point
helper functions needed for RISC-V defined in lib/lib-arm64.c
, because they
were reusing aarch64 code for them.
After this, we also compile and link lib-arm64.c
and we have long double
support.
MesCC struct initialization issues
This one was a lot of fun. Our Bootstrappable TinyCC exploded with random issues: segfaults, weird branch decisions…
After tons of debugging Andrius found some values in struct
s were not set
properly. As we don’t really know TinyCC’s codebase really well, that was hard
to follow and we couldn’t really know where was the value coming from.
Andrius finally realized some struct
s were not initialized properly. Consider
this example:
typedef struct {
int one;
int two;
} Thing;
Thing a = {0};
That’s supposed to initialize all fields in the Thing
struct
to 0
,
according to the C standard4.
As a first solution we set struct fields manually to 0
, to make sure they
were initialized properly. See 29ac0f40a7afb
After some debugging we found that the fields that were not explicitly set were
initialized to 22
. So I decided to go to MesCC and see if the struct
initialization was broken.
This was my first dive in MesCC’s code, and I have to say it’s really easy to
follow. It took me some time to read through it because I’m not that used to
match
, but I managed to find the struct initialization code.
What I found in MesCC is there was a 22
hardcoded in the struct
initialization code, probably coming from some debug code that never was
removed. As no part of the x86 bootstrapping used that kind of initializations,
or nothing relied on them, the error went unnoticed.
I set that to 0
, as it should be, and continued with our life.
MesCC vs TinyCC size problems
The C standard does not set a size for integers. It only sets relative sizes:
short
has to be shorter or equal to int
, int
has to be shorter or equal
to a long
, and so on. If you platform wants, all the integers, including the
char
s can have 8 bits, and that’s ok for the C standard.
TinyCC’s RISC-V backed was written under the assumption that int
is 32 bit
wide. You can see this happening in riscv64-gen.c
, for example, here:
EI(0x13, 0, rr, rr, (int)pi << 20 >> 20); // addi RR, RR, lo(up(fc))
The bit shifting there is done to clear the upper 20 bits of the pi variable.
This code’s behavior might be different from one platform to another. Taking
the example before, of that possible platform that only has 8 bit integers,
this code would send a 0
instead of the lower 12 bits of pi
.
In our case, we had MesCC using the whole register width, 64bits, for temporary
values so the lowest 44
bits were left and the next assertion that checked
the immediate was less than 12 bits didn’t pass.
This is a huge problem, as most of the code in the RISC-V generation is written using this style.
There are other ways to do the same thing (pi & 0xFFF
maybe?) in a more
portable way, but we don’t know why upstream TinyCC decided to do it this way.
Probably they did because GCC (and TinyCC itself) use 32 bit integers, but they
didn’t handle other possible cases, like the one we had here with MesCC.
In any case, this made us rethink MesCC, dig on how are its integers defined, how to change this to be compatible with TinyCC and so on, but I finally decided to add casts in the middle to make sure all this was compiled as expected.
It was a good reason to make us re-think MesCC’s integers, but it took a very long time to deal with this, that could be better used in something else. Now, we all became paranoids about integers and we still think some extra errors will arise from them in the future. Integers are hard.
MesCC add support for signed shifting
Integers were in our minds for long, as described in the previous block, but I didn’t talk about signedness in that one.
Following one of the crazy errors we had in TinyCC, I somehow realized (I don’t remember how!) that we were missing signed shifting support in MesCC. I think that I found this while doing some research of the code MesCC was outputting when I spotted some bit shifts done using unsigned instructions for signed values and I started digging in MesCC to find out why. I finally realized that there was no support for that and the shift operation wasn’t selected depending on the signedness of the value being shifted.
Let’s see this with an example:
signed char a = 0xF0;
unsigned char b = 0xF0;
// What is this? (Answer: 0xFF => 255)
a >> 4;
// And this? (Answer: 0x0F => 15)
b >> 4;
In the example you can see the shifting operation does not work the same way if
the value is signed or not. If you always use the unsigned version of the >>
operation, you don’t have the results you expected. Signs are also hard.
In this case, like in many others, the fix was easier than realizing what was going wrong. I just added support for the signed shifting operation, not only for RISC-V but for all architectures, and I added the correct signedness check to the shifting operation to select the correct instruction. The patch (see 88f24ea8 in Mes) is very clean and easy to read, because MesCC’s codebase is really well ordered.
EDIT: Some person in the web noted I called the bit-shift operations rotation operations. I normally use both words interchangeably but it is true they don’t mean the exact same thing. A shift is when the values are lost, and a rotation when they come from the other side of the register. I edited the article to use the correct word.
MesCC switch/case falls-back to default case
In the early bootstrap runs, our Bootstrappable TinyCC it did weird things.
After many debugging sessions we realized the switch
statements in
riscv64-gen.c
, more specifically in gen_opil
, were broken. The fall-backs
in the switch
were automatically directed to the default
case. Weird!
MesCC has many tests so I read all that were related with the switch
statements and the ones that handled the fall-backs were all falling-back to
the default
case, so our weird behavior wasn’t tested.
I added the tests for our case and read the disassemble of simple examples when I realized the problem.
Each of the case
blocks has two parts: the clause that checks if the value
of the expression is the one of the case, and the body of the case itself.
The switch
statement generation was doing some magic to deal with case
blocks, but it was failing to deal with complex fall-through schemes because
the clause of the target case
block was always run, making the code fall to
the default
case, as the clause was always false because the one that matched
was the one that made the fall-back.
There were some problems to fix this, as NyaCC (MesCC’s C parser) returns
case
blocks as nested when they don’t have a break
statement:
(case testA
(case testB
(case testC BODY)))
Instead of doing this, I decided to flatten the case
blocks with empty
bodies. This way we can deal with the structure in a simpler way.
((case testA (expr-stmt))
(case testB (expr-stmt))
(case testC BODY))
Once this is done, I expanded each case
block to a jump that jumps over the
clause, the clause and then its body. Doing this, the fall-back doesn’t
re-evaluate the clause, as it doesn’t need to. The generated code looks like
this in pseudocode:
;; This doesn't have the jump because it's the first
CASE1:
testA
CASE1_BODY:
...
goto CASE2_BODY
CASE2:
testB
CASE2_BODY:
...
goto CASE3_BODY
CASE3:
testB
CASE3_BODY:
...
If one of the case
s has a break
, it’s treated as part of its body, and it
will end the execution of the switch
statement normally, no fall-back.
This results in a simpler case
block control. The previous approach dealt
with nested case
blocks and tried to be clever about them, but
unsuccessfully. The best thing about this commit is most of the cleverness was
simply removed with a simple solution (flatten all the things!).
It wasn’t that easy to implement, but I first built a simple prototype and Janneke’s scheme magic made my approach usable in production.
All this is added in Mes’s codebase in several commits, as we needed some iterations to make it right. 22cbf823582 has the base of this commit, but there were some iterations more in Mes.
Boostrappable TinyCC problems with GOT
The Global Offset Table is a table that helps with relocatable binaries. Our Bootstrappable TinyCC segfaulted because it was generating an empty GOT.
Andrius debugged upstream TinyCC alongside ours and realized there was a
missing check in an if
statement. He fixed it in
f636cf3d4839d1ca.
The problem with this kind of errors is TinyCC’s codebase is really hard to read. It’s a very small compiler but it’s not obvious to see how things are done on it, so we had to spend many hours in debugging sessions that went nowhere. If we had a compiler that is easier to read and change, it would be way simpler to fix and we would have had a better experience with it.
Bootstrappable TinyCC generates wrong assembly in conditionals
We spent a long time debugging a bug I introduced during the backport when I tried to undo some optimization upstream TinyCC applied to comparison operations.
Consider the following code:
if ( x < 8 )
whatever();
else
whatever_else();
Our Bootstrappable TinyCC was unable to compile this code correctly, instead,
it outputted a code that always took the same branch, regardless of the value
in x
.
In TinyCC, a conditional like if (x < CONSTANT)
has a special treatment, and
it’s converted to something like this pseudoassembly:
load x to a0
load CONSTANT to a1
set a0 if less than a1
branch if a0 not equal 0 ; Meaning it's `set`
This behaviour uses the a0
register as a flag, emulating what other CPUs
use for comparisons. RISC-V doesn’t need that, but it’s still done here
probably for compatibility with other architectures. In RISC-V it could look
like this:
load x to a0
load CONSTANT to a1
branch if a0 less than a1
You can easily see the branch
“instruction” does a different comparison in
one case versus the other. In the one in the top it checks if a0
is set,
and in the other checks if a0
is smaller than a1
.
TinyCC handles this case in a very clever way (maybe too clever?). When they
emit the set a0 if less than a1
instruction they replace the current
comparison operation with not equal
and they remove the CONSTANT
and
replace it with a 0
. That way, when the branch
instruction is generated,
they insert the correct clause.
In my code I forgot to replace the comparison operator so the branch checked
if a0 is less than 0
and it was always false, as the set
operation writes
a 0
or a 1
and none of them is less than 0
.
The commit 5a0ef8d0628f719 explains this in a more technical way, using actual RISC-V instructions.
This was also a hard to fix, because TinyCC’s variable names (vtop->c.i
) are
really weird and they are used for many different purposes.
Support for variable length arguments
In C you can define functions with variable argument length. In RISC-V, those arguments are sent using registers while in other architectures are sent using the stack. This means the RISC-V case is a little bit more complex to deal with, and needs special treatment.
Andrius realized in our Bootsrappable TinyCC we had issues with variable length
arguments, specially in the most famous function that uses them: printf
. He
also found that the problem came from the arguments not being properly set and
found the problem.
Reading upstream TinyCC we found they use a really weird system for the defines
that deal with this. They have a header file, include/tccdefs.h
, which is
included in the codebase, but also processed by a tool that generates strings
that are later injected at execution time in TinyCC.
This was too much for us so we just extracted the simplest variable arguments definitions for RISC-V and introduced that in MesLibC and our Bootstrappable TinyCC.
Extra: files generated with no permissions
The bootstrappable TinyCC built using MesCC generated files with no permissions and Andrius found that this problem came from the variable length argument support definitions. So he fixed that, too5.
The macro that defined va_start
was broken pointer arithmetic. At the
beginning he thought it was related with MesCC’s internals but he tested in GCC
later and realized the problem was in the macro definition. That’s why
currently the commit says “workaround” in the name, but it’s more than a
workaround: it’s a proper fix. We are rewording that, but that would happen
after we release this post.
MesLibC use signed char
for int8_t
We already had a running Bootstrappable TinyCC compiled using MesCC when we stumbled upon this issue. Somehow, when assembling:
addi a0, a0, 9
The code was trying to read 9
as a register name, and failed to do it (of
course). It was weird to realize that the following code (in riscv64-asm.c
)
was always using the true branch in the if
statement, even if
asm_parse_regvar
returned -1
:
int8_t reg;
...
if ((reg = asm_parse_regvar(tok)) != -1) {
...
} else ...
I disassembled and saw something like this:
call asm_parse_regvar ;; Returns value in a0
reg = a0
a0 = a0 + 1
branch if a0 equals 0
This looks ok, it does some magic with the -1
but it makes sense anyway. The
problem is that it didn’t branch because a0
was 256
even when
asm_parse_regvar
returned -1
.
During some of the int
related problems someone told me in the Fediverse that
char
‘s default signedness is not defined in the C standard. I read MesLibC
and, exactly: int8_t
was defined as an alias to char
.
In RISC-V char
is by default unsigned
(don’t ask me why) but we are used to
x86 where it’s signed
by default. Only saying char
is not portable.
Replacing:
typedef char int8_t;
With:
typedef signed char int8_t;
Fixed the issue.
From this you can learn several things:
- Don’t assume
char
‘s signedness in C - If you design a programming language, be consistent with your decisions. In
C
int
is alwayssigned int
, butchar
‘s don’t act like that. Don’t do this.
MesLibC Implement setjmp
and longjmp
Those that are not that versed in C, as I was before we found this issue, won’t
know about setjmp
and longjmp
but they are, simplifying a lot, like a
goto
you can use in any part of the code. setjmp
needs a buffer and it
stores the state of the program on it, longjmp
sets the status of the program
to the values on the buffer, so it jumps to the position stored in setjmp
.
Both functions are part of the C standard library and they need specific support for each architecture because they need to know which registers are considered part of the state of the program. They need to know how to store the program counter, the return address, and so on, and how to restore them.
In their simplest form they are a set of stores in the case of the setjmp
and
a set of loads in the case of longjmp
.
In RISC-V they only need to store the s*
registers, as they are the ones that
are not treated as temporary. It’s simple, but it needs to be done, which
wasn’t in neither for GCC nor for RISC-V in MesLibC.
Andrius is not convinced with our commit in here, and I agree with his
concerns. We added the full setjmp
and longjmp
implementations directly
stolen from inspired in the ones in Musl6 but it has also
floating point register support, using instructions that are not implemented in
TinyCC yet. This is going to be a problem in the future because later
iterations will try to execute instructions they don’t actually understand.
There are two (or three) possible solutions here. The first is to remove the
floating point instructions for now (another flavor for this solution is to
hide them under an #ifdef
). The second is to implement the floating point
instructions in TinyCC’s RISC-V assembler, which sounds great but forces us to
upstream the changes, and that process may take long and we’d need to patch it
in our bootstrapping scripts until it happens.
We just added the #ifdef
s because our code is full of them anyway and sent it
to Mes: 0e2c5569.
More
Those are mostly the coolest errors we needed to deal with but we stumbled upon a lot of errors more.
Before this effort started Andrius added support for 64 bit instructions in Mes and fixed some issues 64bit architectures had in M2.
I found a bug in Guix shell (it’s still open) and had to fix some ELF headers in MesCC generated files because objdump and gdb refused to work on them.
Andrius also found issues with weak symbols in MesLibC that were triggered because TCC didn’t have support for them, thankfully upstream TCC had that issue fixed and we just cherry-picked for the win.
He even had the energy to test all this in real RISC-V we specifically acquired for this task.
There are many more things to tell, but this is already getting too long and if I continue writing we’ll probably end up fixing some stuff more.
In the end, a project like this is like hitting your head against a wall until one of them breaks. Sometimes it feels like the head did, but it’s all good.
Reproducing what we did
All we did means nothing if you can’t reproduce it. We provide two ways to reproduce this process: live-bootstrap and Guix.
Both provide a similar thing but there are some differences from the high-level that is worth mention now.
Comparing with live-bootstrap
, using Guix helps because it reuses the
previous steps if they didn’t change. This results in shorter waits once Mes is
sorted out.
On the other hand, I’ve have had issues with the failed builds in Guix (in
emulated systems). It was hard to jump inside the build container and play
around inside so the development cycle suffered a lot. In live-bootstrap
, if
you are good with bwrap
you can jump and tweak things with no issues.
For those who enjoy digging in the code and trying to follow the process I
recommend following live-bootstrap
‘s scripts. The directory structure is a
little bit confusing but the scripts are very plain and linear. The ones in the
Guix process come from previous bootstrap efforts and they are designed to do
many things automagically, that makes them a hard to follow.
Using live-bootstrap
Andrius is part of the live-bootstrap
effort and he’s doing all the scripting
there to keep the process reproducible.
Live-bootstrap is…
An attempt to provide a reproducible, automatic, complete end-to-end bootstrap from a minimal number of binary seeds to a supported fully functioning operating system.
That’s the official description of the project. From a more practical perspective, it’s a set of scripts that build the whole operating system from scratch, depending on few binary seeds.
That’s not very different to what Guix provides from a bootstrapping perspective. Guix is “just” an environment where you can run “scripts” (the packages define how they are built) in a reproducible way. Of course, Guix is way more than that, but if we focus on what we are doing right now it acts like the exact same thing.
NOTE:
live-bootstrap
‘s project description is a little bit outdated. If you read the comparison with Guix, what you’d read is old information. If you want to read a more up-to-date information about Guix’s bootstrapping process I suggest you to read this page of Guix manual: https://guix.gnu.org/manual/devel/en/html_node/Full_002dSource-Bootstrap.html
Being very different projects, in a practical level, the main difference
between them is live-bootstrap
is probably easier for you to test if you are
working on any GNU/Linux distribution7.
If you want to reproduce this exact point in time you only need to use my fork
of live-bootstrap, branch
riscv-tcc-boot
. I also made a tag on it, self-hosted-tcc-rv64
, to make it
easier to remember when was this post released. Andrius made all the magic to
set that process to take all the inputs from Mes and TinyCC from the correct tag.
Clone the repository, set up the dependencies and run this (if you are not in a RISC-V host you need to configure Qemu and binfmt):
./rootfs.py --bwrap --arch riscv64 --preserve
That should, after a long time, reach a point where there’s a properly compiled bootstrappable TinyCC.
Using Guix for a reproducible environment
I made a Guix recipe that can replicate the whole process, too. It took me long time to make it work but it finally does.
From my TCC fork reproducing this should be easy for the people versed in Guix.
There’s a guix
folder with some files, (most of them broken, not gonna lie)
but there are two you should pay attention to:
-
channels.scm
stores the state of my Guix checkout so you can reproduce it in the future usingguix time-machine
. At the moment it doesn’t feel necessary but if something fails when you try it, please refer to that. -
commencement.scm
is an edited copy of the Guix bootstrapping process, directly obtained fromgnu/packages/commencement.scm
from Guix’s codebase. I patched this to make it work for RISC-V, using some more modern commits in the dependencies.
In order to reproduce all our work in Guix you just need to build tcc-boot0
package from the commencement.scm
file using riscv64-linux
as your
--system
. I’m a nice guy so I just added a command there you can use for
this, just run:
./tcc-boot0-from-source.sh
And that should build the whole thing. It takes hours, you have been warned.
Also it adds --no-grafts
(thanks Efraim), because if you keep the grafts it
compiles the world from scratch (curl, x11… not good).
If you just want to build mes-boot
as an intermediate step, I also made a
file for that:
./mes-boot-from-source.sh
The both scripts will load variables from the commencement.scm
module
provided. The module is not complex if you are used to Guix, but it calls
some complex shell scripts in both Mes and TinyCC to build. Those contain all
the magic.
Conclusions
Of course, the problems we fixed now look easy and simple to fix. This blog post doesn’t really do justice to the countless debugging hours and all the nights we, Andrius and I, spent thinking about where could the issues be coming from.
The debugging setup wasn’t as good as you might imagine. The early steps of the bootstrap don’t have all the debug symbols as a “normal” userspace program would. In many cases, function names were all we had.
I have thank my colleague Andrius here because he did a really good debugging job, and he provided me with small reproducers that I could finally fix. Most of the times he made the assist and I scored the goal.
He also did a great job with the testing which I couldn’t do because I was struggling with Guix from the early days, trying to make the compilers find the header files and libraries.
In the emotional part it is also a great improvement to have someone to rely on. Andrius, Janneke and I had a good teamwork and we supported each other when our faith started to crumble. And believe, it does crumble when a new bug appears after you fixed one that you needed a week for. There were times this summer I thought we would never reach this point.
It’s also worth mention here that the bootstrapping process is extremely slow: it takes hours. This kills the responsiveness and makes testing way harder than it should be. Not to mention that we are working on a foreign architecture, which has it’s own problems too.
If you have to take some lesson from something like this, here you have a suggestion list:
- The simplest error can take ages to debug if your code is crazy enough.
- Don’t be clever. It sets a very high standard for your future self and people who will read your code in the future.
- I guess we can summarize the previous two points in one: If we could remove TinyCC from the chain, we would. It’s a source of errors and it’s hard to debug. The codebase is really hard to read for no apparent reason.
- When build times are long, small reproducers help.
- Add tests for each new case you find.
- Don’t trust, disassemble and debug.
- Be careful with C and standards and undefined behavior.
- Integers are hard. Signedness makes them harder.
- Being surrounded by the correct people makes your life easier.
Also, as a personal note I noticed I’m a better programmer since the previous post in the this series. I feel way more comfortable with complex reasoning and even writing new programs in other languages, even if I spent almost no time coding anything from scratch. It’s like dealing with this kind of issues about the internals give you some level of awareness that is useful in a more general way than it looks. Crazy stuff.
If you can, try to play with the internals of things from time to time. It helps. At least it helped me.
What is next?
Now we have a fully featured Bootstrappable TinyCC we need to decide what to do next.
On the short term, all this has to be released in the original projects: Mes, M2, and so on. That’s the easy part, as everything has proved to be ready.
On the mid term, it’s not very clear what to do first. We suspect we’ll need upstream TinyCC for the next steps, because we many different tools to continue with the bootstrapping chain, and the bootstrappable TinyCC might not be enough to build them. On the other hand, when we go for a standard library we’ll miss the extended assembly support we already mentioned. There’s some uncertainty in the next step.
The long-term is pretty much clear though, the goal is GCC. First GCC for C and then for C++ to make it able build GCC 7.5 which should enable the rest of the chain pretty easily (famous last words). I anticipate we are going to have problems with GCC (I know this because I left them there last time) so we’ll need to fix those, too. Once that is done, we would use GCC to compile more recent versions of GCC until we compile the world.
That’s more or less the description of what we will do in the next months.
And this is pretty much it. I hope you learned something new about C, the Bootstrapping process or at least had a good time reading this wall of text.
We’ll try to work less for the next one, but we can’t promise that. 😉
Take care.
-
There are many rounds. Like 7 or so. ↩
-
So it can compile itself again an again, but who would want to do that? ↩
-
This is how we managed to fix most of the problems in our code: make a small reproducer we can test separately so we can inspect the process and the result easily. ↩
-
You can see an explanation in the (1) case at cppreference.com ↩
-
He is like that. ↩
-
Yo, if it’s free software it’s not stealing! Please steal my code. Make it better. ↩
-
If you run it in Guix or in a distribution that doesn’t follow FHS you’d probably need to touch the path of your Qemu installation or be careful with the options you send to the
rootfs.py
script. ↩